Introduction

This is the 4th project of Udacity’s Data Analyst NanoDegree Program. We were given several data sources as options to analyze from. I chose the Arizona’s 2016 Presidential Campaign Finance from the Federal Election Commission (FEC) website.

The format of this analysis will be as the following:

1- I will declare my intentions with a hypothesis (if applicable)

2- Insert a R snippet/code and run it

3- Declare my findings.

And so on…

First I will begin the analysis by exploring basic statistics about the data set. This will help me see the nature of the data, and whether the data needs cleaning or wrangling. Afterwards, I will explore variable and multivariate relationships, by using the methods I have learned in chapter 4, such as scatter, line, box plots and histograms. This is the basic outline of the analysis, but surely I will find interesting things to talk about along the way.

Data Preparation and Munging

The structure of the data is as the following The file has 19 variables, and these are the most important ones to for the analysis:

I wish if there was a party and a gender column, I will try to it below.

I also would like to add other information such as latitudes and longitudes for map analysis

Below I have standardized the zipcodes to a 5 digit zipcodes and added the to a new column called ‘clean_zip’. These standardized zipcodes will be used in my analysis afterwards.

I will add candidate gender column below

# Add Gender col to candidates

# Gender indices
m_index <- c('Bush, Jeb', "Carson, Benjamin S."
            , "Christie, Christopher J", "Cruz, Rafael Edward 'Ted'",
            "Gilmore, James S III" ,
            "Graham, Lindsey O.", "Huckabee, Mike", 
            "Jindal, Bobby", "Kasich, John R.",
            "Paul, Rand", "Perry, James R. (Rick)",
            "Rubio, Marco", "Trump, Donald J.",
            "Walker, Scott", "Sanders, Bernard",
            "Lessig, Lawrence", "O'Malley, Martin Joseph",
            "Webb, James Henry Jr.", "Johnson, Gary",
            "McMullin, Evan", "Christie, Christopher J.", 'Santorum, Richard J.'    
            )

f_index <- c("Clinton, Hillary Rodham", "Fiorina, Carly", "Stein, Jill" )



#simple cand_gender
az$cand_gender <- NA
attach(az)
az$cand_gender[cand_nm %in% m_index] <- "Male"
az$cand_gender[cand_nm %in% f_index] <- "Female"
detach(az)
# convert cand_gender to factor
az$cand_gender <- factor(az$cand_gender)

Now that I added candidates’ genders, I’ll add the contributors’ genders, by using the gender package.

Exploratory Data Analysis

Below I will explore the top occurrences (counts) of the following:

1- Cities 2- Candidates 3- Contributors

Now I would like to explore the distribution by sum of money contributed.

The above dataframe has made me curious as why these particular total amounts were contributed. After some research I have found some interesting findings, regarding the most common total amount ( $2,700) according to the Federal Election Commission’s website (https://transition.fec.gov/pages/brochures/citizens.shtml), it states that “An individual may give a maximum of:

$2,700 per election to a federal candidate or the candidate’s campaign committee."

Regarding the 3rd most common amount (5,400) the website states “A husband and wife each have separate contribution limits, even if only one spouse has an income. For example, a couple may contribute a $5,400 check to a candidate’s primary campaign as long as both sign the check.”

The point that I am trying to make here is we can use this dataframe to infer what kind of contributors we have in the dataset, such as couples, singles, fundraisers, corporate and etc..

I would like to know how the data is distributed in terms of candidate and contributor gender count in the dataset.

The plot on the left shows the count of contributions for each party, and the plot on the right shows the sum of contributions to each party.

It appears that we have negative numbers, that goes all to -5400. I believe that it represents refunds, since the most receipt comment is refund.

I want to find out the amount stats without the refunds.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     0.04    15.00    27.00    80.21    61.86 10800.00
##  'table' int [1:2224(1d)] 1 1 2 2 1 25 1 861 1 1 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ non_zero_rec: chr [1:2224] "0.04" "0.12" "0.24" "0.5" ...
## non_zero_rec
##     3     5     8    10    15    19    20    25    27    28    35    38 
##  1126  7727  1869 11486  5699  2539  2638 17543  5432  2258  2322  1014 
##    40    50    75    80   100   200   250   500 
##  2340 13346  1325  2038 11547  1941  4095  1380

I wonder why the odd numbers such as 19, 27 or even 38.

This video from the Sanders campaign intrigued me (https://www.youtube.com/watch?v=3nJhuVBXlXk), it claims that the average contribution for the Sanders campaign was $27. Meaning that most Bernie supporters are low income workers.

I would like to explore what is the median contribution received for each candidate. To be fair, I have excluded candidates that had less than 400 contributions from the dataset.

# Idea: see the average of each candidate contriubution, think: Bernie's $27 campain
avg_amt_cand <- az[az$contb_receipt_amt > 0, ] %>% group_by(cand_nm) %>%
  summarise(avg = mean(contb_receipt_amt),
            median = median(contb_receipt_amt),
            n= n())


avg_amt_cand
ggplot(avg_amt_cand[avg_amt_cand$n > 400,],
       aes(cand_nm, median))+
  geom_bar(stat = 'identity')+
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Interestingly, the highest median contributions were given to republican candidates. Meanwhile, both Clinton and Sanders had the least median of contribution amounts.

Below I would like to first discover the date distribution of contributions.

The second graph demonstrates the volume of distributions per candidate along the dates. From the graph we can see when did each campaign gain or lose it’s momentum, also it is apparent when the candidates dropped out of the race.

The above plot shows the cumulative of amount contributed and number of contributions along the dates, breaked down by party. Republicans have contributed slightly more in total, but the number of democratic contributions significantly outweighs the number of republican contributions.

Below we will see the number of contributions for each candidate and how they break out:

I would like to see the box-plot of each gender/party contribution. Republicans contributed more on average, and they had a higher range of contribution amounts. Male republicans contributed slightly more on average than their female counterparts.

## 
## female   male 
##  65569  54083

I did not expect to find more female contributors than males in this data-set.

Lets explore if females were more likely to vote for female candidates.

## [1] 0.546188
## [1] 0.6540687

##         
##          Female  Male
##   female  35813 29756
##   male    18709 35374

%54.5 females of this data-set contributed to females, while %65.4 of males contributed to males, which is a negligible preference. The graph above demonstrates that preference.

I would like to explore job information, and how that plays into the dataset.

One major problem of this dataset is the occupation fields has a variety of problems, such as miss-spellings and duplicates of the the same job title (‘EX. ADMIN.’ & ‘EXEC. ADMIN.’). These types of problems will definitely skew my findings. As an attempt to solve this problem, I decided to resort to Altryx’s Fuzzy Match tool, which matches up with fields that have the same resemblance. It is not perfect by no means, but it does the job of consolidating alias fields (with different namings). It decreased the number of job titles from 4955 to 539. The model is somewhat underfitted, as it has a standardized list of job titles extracted from the main database by selecting jobs that have 20 or more entries in the database (539 jobs). Based on that list I used fuzzy matching on the whole database to fit those 539 jobs. As a result 13,873 job records has been standardized.

1- Export az file to alteryx 2- Run the workflow 3- Import standardized jobs dataframe here

# export file
require(xlsx)
## Loading required package: xlsx
## Loading required package: rJava
## Error: package or namespace load failed for 'rJava':
##  .onLoad failed in loadNamespace() for 'rJava', details:
##   call: fun(libname, pkgname)
##   error: JAVA_HOME cannot be determined from the Registry
write.csv(az, "az_mod1.csv")

read.csv('az_mod1.csv')
# import altryx output
az_std_jobs <- read.csv('STD Jobs 2.0 .csv')

az_std_jobs$contbr_occupation <- az_std_jobs$Right_contbr_occupation
length(unique(az$contbr_occupation))
## [1] 4303
# most genrous occupations and account of total 
job_n_avg <- az_std_jobs %>% group_by(contbr_occupation) %>%
  summarise(n= n(), avg =mean(contb_receipt_amt)) %>%
  arrange( -avg )






#plot most contributed on average
ggplot(job_n_avg[job_n_avg$n >110, ][1:10,],
       aes(contbr_occupation, avg)
)+
  geom_bar(stat= 'identity')+
theme(axis.text.x=element_text(angle=90,hjust=1))+
  ggtitle('most contributed on average')

# least contributed on avg

least_contrib_jobs <- arrange(job_n_avg[job_n_avg$n > 200, ], avg)
least_contrib_jobs$contbr_occupation <- factor(least_contrib_jobs$contbr_occupation,
                                          levels =  least_contrib_jobs$contbr_occupation)

ggplot(least_contrib_jobs[1:30,],
             aes(contbr_occupation, avg)
)+
  geom_bar(stat= 'identity')+
theme(axis.text.x=element_text(angle=90,hjust=1))+
  ggtitle('least contributed on average (jobs with more than 200 entries)')

I am curious to see if there is a political lean in academia.

academia majority percentage:

Around %80 of colleges had a democratic preference. The majority of %59 of contributions were for Clinton, Sanders comes in second of %37. Cruz came in third (%1.42) and Trump close fourth (%1.4). Please note that these statistics has been obtained by using the mode of each party and candidate in each university.

Below I will find the stats of homemakers.
##      female        male           N 
##   96.121097    3.878903 1057.000000
##    democrat independent  republican           N 
##   65.657521    1.324503   33.017975 1057.000000

Though not surprising, 96% of homemakers in this dataset are females.

66% of homemakers in the dataset are democrats, while 33% are republicans.

Retirees play a huge role in this dataset, lets find out how.

## [1] "the retirees in the data are: "
## [1] 35.61638
##      democrat   independent    republican             N 
##    59.8303917     0.4464955    39.7231129 33371.0000000
## [1] 81.37492

As we can see above, retirees make up about %35.6 of the data-set. Around %60 of retirees contributed to democrats and around %40 percent to republicans, contributions to independents are negligible. Retirees contributed $81 on average.

I want to know which occupations are most politically active, and how do they lean politically.

##                        [,1]
## RETIRED               32749
## NOT EMPLOYED          13737
## INFORMATION REQUESTED  3214
## ATTORNEY               2107
## PHYSICIAN              1897
## TEACHER                1821
## ENGINEER               1389
## CONSULTANT             1272
## PROFESSOR              1239
## SALES                  1238

The most politically active occupations in the data set are attorneys, physicians then teachers.

Below I would like to know the proportions of party leaning for each job. For example, of all engineers how many percent of them lean republican (number of republican engineers/ total number of engineers).

For the republican bar chart, most jobs are represented from the private sector, such as CEOs, executives, real estate and business owners.

The democratic bar chart lacks variability as most of the jobs are 98% and above represent democrats. Most of the jobs are from the public sector such as social workers, educators, teachers and professors.

I would like to see average spending along dates

## integer(0)

## integer(0)

Now I would like to research the data geographically. I used the ‘choroplethr’ library, which was very useful to my analysis.

Above is a heatmap representing the total spend of each zipcode within Maricopa county. As we can see most contributions are from the inner city of Phoenix, with some exceptions. Though I have to note that the scales of the plots are not perfect.

I relied on df_zip_demographics as an outer source for the below plots, and I have merged them to my df, which helped me explore new findings.

I am wondering what kind jobs contributed to Donald trump, my intuition says it’s mostly blue collar jobs. Let’s find out!

My hypothesis is false, most of trumps contributors have white collar jobs, even the higher income types such as engineers, consultants, physicians and CEOs.

let me see by number of contributions only if it helps me find out more,

## [1] 0.1344482

The percentage of contributions for trump of the whole data-set is 18%.

By changing some of the subset filters, still the majority were high income occupations, even though we have some blue collar jobs such as truck driver and construction, but they were the minority. My hypothesis is blue-collar workers cannot afford to contribute therefore, they are underrepresented in this data-set.

I want to see if higher income zip-codes had more contributions and I will use a scatter-plot to demonstrate. There is only a strong relationship When I subsetted the data to 100 contributions at least per zip-code. Doing otherwise will skew the data and the relationship will not be apparent.

This is obvious but still I would like to see the relationship between number of contributions and population of zip-code.

There is a strong correlation at first, but then as population increases the relationship weakens.

Final Plots:

It seems that the average amount of contributions were huge at the beginning of 2015, but when I added a 4th variable (n = number of contributions) it shows that these were a few outliers, the mass of the contributions came in mid 2016 as it lowered the average but the size (number of contributions) were bigger substantially.

The above plot demonstrates the relationship between number of contributions vs population by zipcode. As expected as the population increased, the number of contribution increased. Although, the curve was steepest from coordinates (0,0) until (250, 20000) then the curve stabilized throughout the rest of the plot. Which means that the relationship was strongest initially, but after the (250, 20000) coordinate threshold the relationship weakened between population and number of contributions.

The above plot was made by an outer source. Although it is not critical to my analysis, I found it interesting. The plot demonstrates the relationship between median income and total population per zipcode. We can safely infer that as the population density increases in a zipcode, the rent consequently increases, therefore people need to have a higher income to be able to afford living in higher population zipcodes.

The above graph shows the relationship between average amount contributed vs. median income (USD) per zipcode. There is a weak relationship from 0 on the x-axis until around 40,000 on the x-axis the relationship began to become stronger, and the curve became steeper. Though I have to note that the latter might be caused due to the right-tail outliers.

This plot shows the average contribution (USD) per county. The range of the highest and lowest average is $33 (68-35).

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

One of the first things that caught my eye is that the republican average contribution has a much higher disparity than the democrat plot, 94.8 vs. 32.3, respectively.

Reflection:

overall this project was a good challenge and learning experience. At first it was easy and enjoyable exploring the data, as I went deeper into the analysis it became harder to come up with relationships and conclusions about the data. I wanted my analysis to have a central theme/thesis, the fact of not drawing a certain conclusion made me feel frustrated.

I was impressed with the versatility of R, and its packages, I felt like it was more intuitive than python, maybe because I have a background with Alteryx. Although, R felt like it had less support on stackoverflow than python, but there’s support nonetheless, which aided me significantly throughout the project. I also used Datacamp for filling in the knowledge gaps and reinforcing the concepts learned in the Udacity curriculum. I have not utilized Udacity’s live help as much as the other projects, because I did not face problems with programming itself, rather than loss of ideas and direction of my analysis.

This project was a great opportunity to reinforce the skills I learned through Udacity’s business analyst nanodegree such as using Alteryx.

In terms of visualizations, R is fantastic for data exploration, although it is lacking the ability to export high resolution plots. I feel that Tableau is more suitable for findings/conclusive plots.